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Abstract. We introduce weak transport costs that are weakened forms of 
the transport costs defined by Marton in 1261 . We obtain new weak transport 
inequalities for non products measures similar than those obtained by Samson 
in 1321 but valid also for other metrics than the Hamming distance. Many ex- 
amples are provided to show that the euclidian norm is an appropriate metric 
for many classical time series. The dual form of the weak transport inequalities 
yield new exponential inequalities and extensions to the dependent case of the 
classical result of Talagrand [33] for convex functions that are Lipschitz con- 
tinuous. Expressing the concentration properties of the ordinary least square 
estimator as a conditional mass transport problem, we derive from the weak 
transport inequalities new oracle inequalities with fast rates of convergence. 
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1. Introduction 

Since the seminal work of Marton [24], transport inequalities efficiently yield 
dimension free concentration inequalities. Using a duality argument, Bobkov and 
Gotze [6| even proved that transport inequalities are equivalent to some concen- 
tration inequalities. Our references on the subject are the monograph of Villain 
|36| and the survey of Gozlan and Leonard |16) . Transport inequalities appear as 
a nice alternative to the classical modified log-Sobolev approach of Massart [58] 
for obtaining dimension free concentration inequalities useful in mathematical sta- 
tistics. More specifically, dimension free concentration inequalities are used to get 
oracle inequalities with fast rates of convergence. This article develop new kinds 
of transport inequalities, new exponential inequalities and new oracle inequalities 
with fast rates of convergence. 

In the case of product measures, the classical modified log-Sobolev approach de- 
veloped by Massart in [5S] leads to optimal dimension free concentration inequalities 
of Bernstein's type. However, for non product measures, such inequalities do not 
hold in their optimal form in many situations. The reason is the following: in the 
bounded iid case, Bernstein's inequality yields gaussian behavior for deviations less 
than a bound depending on the essential supremum. In many bounded dependent 
cases, their exists a unique regeneration scheme of iid cycles with random length. 
The Bernstein inequality yields gaussian behavior for small deviations less than a 
bound depending on the essential supremum and also on the concentration prop- 
erties of the random length, see Bertail and Clemencon [5]. It is a drawback for 
statistical applications where the variance term, which is essential, is perturbed by 
the concentration properties of the random length. It leads, to an additional term, 
at least logarithmic, which cannot be removed, see Adamcsak pQ. To bypass this 
problem, many authors assumed contractions conditions on the conditional mea- 
sure, see Marton [52] for the total variation metric, Lezaud [53] under a spectral 
gap condition for the kernel of a Markov chain. For symmetric Markov process, this 
second condition is more general and it is also necessary for Bernstein's inequality, 
see Guillin et al. |17| . 

Many classical models in time series analysis do not satisfy such conditions. For- 
tunately, the classical Bernstein's inequality also holds for non contraction condi- 
tions but under 7-weakly dependent conditions, closely related with uniform mixing 
conditions, see Samson [35]. This result yields fast convergence rates of order -nT 1 in 
oracle inequalities (comparable to those in the iid case) in a dependent setting, see 
[5J. However, this approach relies on the maximal coupling properties of the Ham- 
ming distance and cannot be extended to other metrics, see [TT]. For other metrics, 
non optimal couplings are used by Marton [57] and Djellout et al. [12] to extend 
classical dimension free transport inequalities T2(C) in a dependent context. If the 
"constant" C in the transport inequality is sufficiently close to the variance term 
then Bernstein's inequality is recovered and fast convergence rates are achieved, 
see Joulin and Ollivier [T§]. Otherwise, the statistical convergence rates are lower 
than n~ l because a tradeoff must be done between the estimate of the variance 
and the accuracy of coupling schemes that are not dimension free, see Winten- 
berger [37] for details. The fast rates of convergence in mathematical statistics are 
not achieved in general dependent contexts due to the variance term appearing in 
the classical dimension free inequalities of Bernstein's types. On the contrary, the 
Hoeffding's inequality that do not have a variance term is easily extended to very 
general dependent case, see Rio [30] and Djellout et al. [T5J. Unfortunately, the 
Hoeffding inequality, equivalent to the T\ (C) transport inequality, is not dimension 
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free. Thus, this probabilistic inequality yields low rates of convergence of order 
7i -1 / 2 , see Alquier and Wintenberger [3J. 

In this paper we develop new probabilistic tools to obtain dimension free expo- 
nential inequalities and thus fast convergence rates in oracle inequalities. Let (E, d) 
be a Polish space. With the notation P[h] = J hdP for any probability measure P 
and any measurable function h, we say that P satisfies the new transport inequality 
Tp(C) for any C > and 1 < p < q if for any measure Q 

a {Q[a(Y ) q \yi q 

with 1/p + 1/q = 1 and the convention +00/ + 00 = 0/0 = 0. Here a is any non- 
negative measurable function, ir is any coupling scheme of (X, Y) with margins 
(P, Q) and K,(P\Q) is the relative entropy Q[\og(dP/dQ)} (also called the Kullback- 
Leibler divergence). As the role of P and Q is not the same, we also introduce 
ff>{C) where P and Q are interchanged in the left hand side term. These inequal- 
ities are weakened versions of the transport inequalities introduced by Marton |26) 
(for d being the Hamming distance) 

n[a(Y)d(X,Y)] 



7T£Ma>o (Q[a(Y) c i\) L /i 

These inequalities are already weakened forms of the classical T p {C) transport in- 
equality 

inf n[ffi{X,Y)] l l* < y/2CK,{Q\P). 

7r£M 

Contrary to the classical T p {C) transport inequalities, any compactly supported 
measure P satisfies the weak T P (C) transport inequalities for any 1 < p < 2. More- 
over, the weak transport inequalities extend nicely to non-products non-contractive 
measures P on E n , n > 1. Using a new Markov coupling scheme, our main result 
in Theorem 13.21 states that there exists C > such that 



y;™ w\a i (Y)d(X i ,Y i )\ I 

(L1) 

when conditional measures P x .^ x (i), = (#,-, . . . , Xo) satisfy the weak transport 
inequalities T p (C) and under a new 7(p)-weak dependent condition: 

Wp(P Xh \ x ii),P Xk \ x (i-i) tVi ) < 7k,i(p)d{Xi,yi), 0<i<k<n. 

When d is the Hamming distance, the 7(2)-weak dependance coincides with the 
context of weakly dependence already studied by Samson |32| and we obtain simi- 
lar results. We keep the notation and denote j(p) the weak dependence coefficients 
when d is the Hamming distance. However, to tackle much more general and clas- 
sical time series contexts, we prefer to choose d as the euclidian norm, see Section 
H Then, when p = 1 and Ti(C) = f^\C) = Ti(C) by definition, the 7 (l)-weak 
dependence is linked with the weak dependence notion introduced by Rio in [30] as 
discussed in Djellout et al. [12J. Thus we recover the Hocffding's inequality of [3D] 
which is not dimension free because n 2 / p_1 = n as p = 1. 

The dual forms of the weak transport inequalities yield new exponential inequal- 
ities. Except in the specific case of tha Hamming distance, the deviations are not 
estimated in terms of the variance and contrary to the Hoeffding's inequality, it is 
a dimension free inequality when p = 2. If P satisfies 12(C) on E n then for any 



4 



OLIVIER WINTENBERGER 



function / of the observations [X%, . . . ,X n ) such that there exist functions Lj(x) 
satisfying f(x) - f(y) < Y%=i L j{x)d{x j ,y j ) for any x,y € (R d ) n we have 

(1.2) p[ ex p(A(/-P[/])-— Y, L if))\ <exp(AP[/]), A > 0. 

i=i 

When d is the Hamming distance, inequality (jl.2p yields to the classical Bern- 
stein's inequality, see Ledoux [35] in the independent setting and Samson [23 m 
the uniform mixing setting. When the function / is a convex function, it satisfies 
the above condition with Lj = dj its sub-gradient and the inequality (jl.2p coin- 
cides with generalizations of the Tsirel'son inequality of [3j] (also implied by the 
T2 transport inequality, see Bobkov et al. [7]). For convex functions that are also 
Lipschitz continuous the inequality stated above leads to new extensions of the 
classical exponential inequality due to Talagrand [33J for products measure. 

As n 2 /^ 1 = 1 for p — 2, combining inequalities (| 1 . 1 [) and (|1.2p we obtain new 
dimension free exponential inequalities for many dependent classical time series 
that are 7(2)-weakly dependent. As the transport inequalities yield concentration 
of measures via relative entropy, we couple it with the statistical PAC-bayesian 
aparadigm that describes the accuracy of estimators in term of relative entropy 
too, see McAllester [29J. The oracle inequalities can thus be expressed as a condi- 
tional mass transport problem. We apply this new approach to the Ordinary Least 
Square (OLS) estimator 6 in the linear regression context (other interesting statis- 
tical issues will be investigated in the future). Denoting by R the risk of prediction, 
an oracle inequality states with high probability that R{&) < (1 + rj)R(9) + RnV -1 
where 77 > 0, is the oracle defined as R(0) < R(0) for all 8 and R n is the rate of 
convergence. Oracle inequalities are standard non asymptotic criteria for the effi- 
ciency of statistical estimators, see Massart [35]. If 77 = then the oracle inequality 
is said to be exact and otherwise it is non exact, see Lecue and Mendelson [2"T] 
for a discussion. The dimension free concentration properties yield to fast rates 
of convergence R n cx n~~ l . For 7(2)-weakly dependent time series, we obtain new 
nonexact oracle inequalities for the OLS 6 when the conditional measures satisfies 
the weak transport inequalities. These assumptions are satisfied for many models 
such as classical ARMA models with bounded, gaussian or log-concave innovations. 
In the specific case when d is the Hamming distance, we recover in the conditional 
mass transport problem the classical variance term as it was the case in exponential 
inequalities. This variance term plays a crucial role through the so called necessary 
margins condition introduced by Tsybakov [33]. Thus, fixing d as the Hamming 
distance, we obtain new exact oracle inequalities with fast convergence rates for the 
OLS 6 in the 7(2)-weakly dependent case. 

The paper is organized as follow: in Section [2] are developed the properties of 
the weak transport costs used in the proof of our main result, a weak transport 
inequalities for non product measures stated in Section [3] Section H] is devoted 
to some examples. The dual form of the weak transport inequalities yields new 
exponential inequalities presented in Section [5] Finally, new oracle inequalities 
with fast rates of convergence are given in Section [5J 

2. Weak transport costs, gluing lemma and Markov couplings 

2.1. Weak transport costs on E. Let M(F) denotes the set of probability mea- 
sure on some space F, M + (F) the set of lower semi-continuous non negative mea- 
surable functions and M(P, Q) the set of coupling measures n Xty , i.e. tt x ^ v € M(E 2 ) 
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with margins ir x = P and n y = Q. Let (p, q) be real numbers satisfying 1 < p < 2 
and 1/p + 1/q = 1. Let us define the weak transport cost as 

( 2 ,, ™= sup ,„ f -'°'7^; Y)I 

with the classical conventions Qfa 9 ] 1 ^ = ess sup a(Y) when q = oo and +00/+00 = 
0/0 = 0. For fixed a G M+(F), let us denote 

(2.2) # Q (P,Q) = inf 7r[a(r)d(X,y)]. 

7reM(P,Q) 

Notice that T¥ is not symmetric and that W(P,Q) = W(Q,P) = T^ Q (P,Q) = 
W a (Q,P) = if P = Q. Notice that a € M + is assumed lower semi-continuous 
such that the optimal transport in the weak transport costs exist, see for example 
[TE] . Now let us show that the weak transport cost satisfies the triangular inequality. 
It is a simple consequence of the second assertion of the following version of the 
gluing Lemma: 

Lemma 2.1. For any coupling tt x _ v € M(P,Q) and tt VjZ £ M(Q,R) respectively 
there exists a distribution TT x ,y,z with corresponding margins and such that X and 
Z are independent conditionally on Y , i.e. it xz \ y — TT x \ y TT z \y 

Proof. From the classical gluing Lemma, se for example the Villani's textbook |36| . 
we can choose ir x , y , z such that i^ Xl y lZ = ^ x \ y '^ z \y'^y as the margins corresponds: 
KxlyKy = it x ,y and 7r z \y7Ty — TT y , z The conditional independence follows from the 
specific form of tt x , v , z as it x ^ z \ y = ■K x .y. z /iTy by definition. □ 

The conditional independence in the gluing Lemma l2.1l is the main ingredient to 
prove the triangular inequality on W p : 

Lemma 2.2. For any P,Q,R we have 

(2.3) W P (P, R) < W P (P, Q) + W P (Q, R) 
Proof. Let us fix a € M + (E) such that R[a q ] < 00. We have 

n x , z [a{Z)d(X, Z)\ < n[a(Z)d(X, Y)} + Tr y , z [a(Z)d(Y, Z)\. 
Let us choose ir* z satisfying 

n*[a(Z)d(Y,Z)] = inf n[a(Z)d(Y, Z)] < R[a q ] 1/q W p (Q, R). 

By conditional independence in Lemma 12.11 we also have 

n[a(Z)d(X,Y)} = n x Jn; iy [a(Z)\Y]d(X, Y)] =: n x J a (Y)d(X,Y)}. 

Let us choose ir* y satisfying 

7T* Ja{Y)d{X, Y)} = inf n[a(Y)d(X, Y)} < Q[a q ] 1/q W p (P, Q). 

■neM(P,Q) 

Notice that Q[a q ] = Q[7r*| y [a(Z)|y] 9 ] < R[a q ] using Jensen's inequality. Let us 
denote ir* = ir* z obtained by the gluing Lemma 12.31 of 7r* and 7r* z . Collecting 
all these bounds we have n*[a(Z)d{X, Y)] < R[a q ]W p (P, Q). We obtain 

7T* \a(Z)d(X,Z]\ 

R[aq] i /q <(W P (P,Q) + W P (Q,R)). 

and taking the supremum on a the desired result follows from the definition of 
W P (Q,R). □ 
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2.2. Markov couplings. In this section, we only consider Markov couplings on 
the product space E n with n = 2, the cases n > 2 following by simple induction 
reasoning. 

Definition 2.1. Let P, Q G M(E 2 ), the set of Markov couplings M(P,Q) are 
defined as the products 7r = 7Ti7r 2 |i with 7Ti a coupling of Pi and Qx and a 
coupling of P 2 |i and Q 2 \i- 

The terminology of Markov couplings was introduced by Riischendorf in (31j. 
Similar couplings have been used by Marton in |26| . The property of conditional 
independence in the gluing Lemma 12. II is nicely compatible with Markov couplings: 

Lemma 2.3. For any Markov couplings n x ,y € M(P,Q) and ir y . z £ M{P,Q) 
with P, Q, R G M(E ) it exists a distribution i^ x ,y,z with corresponding margins 
and such that X — (Xi,X 2 ) and Z — (Zi,Z 2 ) are independent conditionally on 
Y = (Y U Y 2 ). 

Proof. By assumption n x>y = ^x l , Vl ^x 2 ,y 2 \x 1 , yi and ir VtZ = ^ yi ,z 1 ^y 2: z 2 \ yu z 1 - Let us 
define K x , y , z as ^x 1 , yi ,z 1 ^x 2 ,y 2 ,z 2 \x uyi ,z 1 by the relation 

(2.4) ^xi,yi,zi = n x 1 \y 1 7T z 1 \y 1 n yn 

and 

(2-5) 7T x 2 ,y 2 .z 2 \x 1 ,yi,zi — 7r x 2 \x 1 , y i, y2 7r z 2 \ y i,z 1 , y 2 7r y 2 \yi- 

Let us check that ir x ,v,z Las the correct margins. First, from the classical glu- 
ing lemma we know that n Xlt y ltZl has the correct margins. It remains to prove 
that Tt X2 ,y 2 ,z 2 \x\,yi,zi has the correct margins. Notice that from the definition of 
Markov couplings, we have i^ V2 \ yi = 7T y 2 \x 1 ,y 1 = ^yvXyuzx- Thus the first margin of 

7T x 2 , y2 ,z2\x 1 , yi ,z 1 is equal to 

7T x 2 \x 1 ,y 1 ,y 2 7T y 2 \y 1 — 7T x 2 \x 1 ,y 1 .y 2 7r y 2 \x 1 ,y 1 = 7r x 2 ,y 2 \xi,yi- 

The same reasoning show that the second margin is also the correct one. 

We proved above that by construction X\ and Z\ are independent conditionally 
on Yi, i.e. that K Xl>Zl \ yi = TTx 1 \ yi 7r z 1 \ yi - Let us show that it is also the case 
conditionally on Y\ and Y 2 . We have 

_ ' K x 1 ,z 1 , yi , y2 _ 7T y 2 \ y i 7r x 1 ,z 1 , yi _ 

n x 1 ,Zi\yi,y2 — ~~ — n Xi,Zx\yi 

7l yi:V2 7t y2 \ y i 7l yi 
the third identity following from the identity ify 2 \ yi — Tyakiivi,*! by the identity 
(|2.5p . Thus, using that X\ and Z\ are independent conditionally on Y\ we obtain the 
identity ir Xl , Zl \ yi , y2 = n Xl \ yi ir Zl \ yi . We conclude that n XuXl \ yuV3 = ^ X1 \ yi , y2 ^zi\m,y 2 
as 

^m^i^xi^i _ ^xx,y\,yi _ 
^lll/i — _ _ — —^xx\y u y 2 

n yi\yi n yi n yi,y 2 

the third identity following from the identity ity 2 \ yi — 7T y2 \x 1 , yi by definition of 
Markov couplings (the same is true replacing x\ by Z\). 

It remains to prove that X 2 is independent of Z 2 conditionally on (X\,Z\) and 
(Yi, Y 2 ). Indeed, we have by construction 

_ ' K x 2 , y 2,Z2\x\, y \,z\ _ 7r x 2 ,y 2 ,z 2 \x 1 ,y 1 ,z 1 _ 
7T x 2 ,z 2 \x 1 , yi ,z 1 , y2 — — _ ~ 7T x 2 \x 1 ,y 1 ,y 2 7r z 2 \yi,z 1 ,y 2 j 

7T y2\xi,yi,zi 7T y 2 \yi 
the last identity following from the idenity (|2.5p . Thus the result is proved: 
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2.3. Weak transport costs on E n , n > 2. We extend the definition of W on the 
product space E n for n > 2. Let P, Q e M(E n ) we define 



(2.6) W P (P,Q)= sup inf 



with the convention (X)?=i Q[ a i00 9 ]) — max i<i<n ess sup o^- if q = oo and 

n 

(2.7) W r a(i , .Q)= i>f VVfopOdpO, Yj)] 

neM(P,Q)^i 

for any fixed a = (t*j)i<j<n G M + (.E™). Considering Markov couplings, we can 
use the conditional independence in the gluing Lemma 12.31 to assert that the weak 
transport cost on E n also satisfies the triangular inequality. More useful, W a 
satisfies an inequality similar than the triangular one: 



Lemma 2.4. For any P,Q,R € M(E n ), for any a £ M + (E n ) there exists a € 
M+{E n ) satisfying Q[a j {Y)] q < i2[a?(Z)] fo any 1 < j < n and 

(2.8) W a {P,R) <W & {P,Q)+W a {Q,R) 



Remark 2.1. As a consequence of the Lemma [241 we obtain the triangular inequal- 
ity for W 

(2.9) W P (P, R) < W P (P, Q) + W P (Q, R) 

by taking the supremum on a on both sides of (|2.8[) and using the relation Q[dtj(Y)] q < 
R[a]{Z)]. 

Proof. Let us fix a € M + (E n ) such that R[a q ] < oo for all 1 < j < n. Define 
recursively the couplings n* z and n* G M(E 2 ) such that 



n 

^[^^■(Z)^,^)] = W a (Q, R), 

3 = 1 

n 

<y[Y,<\v\ a ^ Z W^^)} = W Kiy[amY] (P,Q). 



j=i 



where we use Jensen's inequality. Let us denote n* — n* z obtained by the gluing 
Lemma l2~3l of ir* y and 7T* z . Then 



n n n 

j=l j=l j=l 

n 
3=1 

71 

(2.10) < W K]y[a(zm (P,Q) + W a (Q,R). 

The inequality (|2.8p follows from (|2.10p taking a,- = ir* z [ctj(Z)\Y = •] and notic- 
ing that the relation Q[a|(Y")] < R[a?(Z)] holds by an application of Jensen's 
inequality. □ 
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3. Weak transport inequalities 

3.1. Weak transport inequalities. Let us denote JC(P\Q) the relative entropy 
(or the Kullback-Leibler divergence) defined as JC(P\Q) = Q[\og(dQ / dP)] when 
Q << P, JC(P\Q) = oo otherwise. Let us say that the probability measure P on 
E n satisfies the weak transport inequality T P (C) when for all distribution Q on E n 
we have 



(3.1) W P (P,Q) < ^2CJC(P\Q). 

Let us say that P satisfies the inverted weak transport inequality Tp^ (C) when 



(3.2) W P (Q, P) < y/2CJC{P\Q). 

Notice that by definition and by Jensen's inequality P satisfies T p (C) and Tj , (C) as 
soon as f p ,(C) and f^ ] (C) reciprocally with p' > p. Moreover Ti(C) = f^(C) = 
T\{C) where T p (C) is the classical transport inequality defined for any 1 < p < 2 
as 



inf vr^^y)] 1 ^ < y/2ClC(P\Q). 

tt£M(P,Q) 

3.2. Weak transport inequalities on E. Let us consider in this section P a 
probability measure on E (case n = 1). Let us show the following 



(1) Any P G M{E) satisfies 2^(1) and T 2 (1) when d is the Hamming distance 



Theorem 3.1 

Any P 
d(x,y) = l X7 t v . 

(2) Any P £ M(E) satisfies T^-D 2 ) cmd T^iD 2 ) for any metric d such that 
sup {x , y)eE 2 d(x, y)=:D< oo. 

Remark 3.1. Below is the proof of point (1) for the sake of completeness. However, 
it is a direct consequence of Theorem 2 in Marton [26], with an alternative proof 
in Samson [32 as their transport costs are stronger than ours. The constant 1 is 
still optimal for our weaker transport inequality, see the discussion in Section [5.31 

Remark 3.2. By definition every P satisfying the classical transport inequality 
12(C) such that gaussian or log-concave measure satisfies also T2(C) and T^^C). 
However, any distribution having a support with finite diameter satisfies X^C) 
by point (2) of the above Theorem but not necessarily T2(C). For any metric d 
the weak transport inequalities T^C*) and T2 (C) have dual forms given below in 
p.3p and (I3.4[) . These expression are particularly explicit when d is the Hamming 
distance. 

Proof. Let Cb denotes the set of all continuous bounded functions. From the dual 
form of W a for a E M + (E) fixed we have 

W a (P,Q) = MTt[a(Y)d(X,Y)} = sup Q[f a ] - P[f] 

fed 

where f a (y) — ^i x {a(y)d(x,y) + f(x)}. Then a measure P satisfies T2(C) if for 
any a G M + (E) and any probability measure Q 

sup Q[f a ] - P[f] < v/2C7Q[a 2 ]/C(Q|P) = inf \CQ[a 2 ]/2 + IC( - Q ^ P \ 
fec b A " A 

Thus P satisfies T2(C) if for any measure Q it holds 

sup sup sup Q[X(f a - P[f}) - (Xa) 2 C/2] - K(Q\P) < 0. 

A>0 a>0 feC b 
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By the variational form of the entropy we obtain 

(3.3) sup sup sup P[exp(A(/ Q - P[f]) - {XafC/2)] < 1. 
A>OQ>o/ec b 

In the specific case d(x,y) = l x ^ y we have the explicit expression f a {y) = (a(j/) + 
inf /) A f(y). As the difference f a — / is unchanged when adding a constant on /, 
we can take inf / = with no loss of generality and 

su P P[exp(A(/ a - P[f]) - (Xa) 2 C/2)} = P[exp(A(/ - P[f]) - \ 2 f 2 C/2))]. 

a>0 

But for any X > we have X - X 2 /2 < log(l + X) and thus 
P[exp(X - X 2 /2)] < 1 + P[X] < exp(P[X]). 

f 2 (l) follows by taking X = A/. To prove that f 2 (l) (l) holds we start from its 
dual form. For equivalent reasons than the preceding dual form (|3.3p . our weak 
transport inequality (C) holds for any C > iff 

(3.4) sup sup sup P[exp(A(/ Q - P[/]) - P[(Aa) 2 ]C/2)] < 1. 
a>o q>o fec b 

Noticing that we can restrict to a(x) < sup / — f{x), taking sup / = and C = I 
we obtain the sufficient condition 

su P P[exp(A(/ - P[/]) - P[(A/) 2 ]/2)] < 1. 
/<o 

For any non positive r.v. X we have exp(X) < 1 + X + X 2 /2 and the desired result 
follows. 



Point (2) is proved noticing that d(x,y) < D\ x=j t y 



□ 



3.3. Weak transport inequalities on E n , n > 2. Let us present a new coupling 
technique based on the following so called 7(p)-weakly dependent properties of any 
measure P on E n . Add artificially time and put Xq = Yq = xq — yo for a fixed 
point yo £ E. Denote x^ 1 ' — (xi, . . . ,Xo) for i > 0. Recall the classical Wassertein 
distance 

W P (PQ)= inf 7r[d p (X,F)] 1 / p . 

Tr£AI(P,Q) 

Let us work under the following weak dependence assumption: 

Definition 3.1. For any I < 2 < p, any measure d, the probability measure P 
is 7(p)-weakly dependent if for any < i < k < n there exists the coefficient 
lk,i{p) > such that 

Vx« eE l+ \{x kl y k )eE 2 . 



(3.5) W P (P X 
Let us denote 



/ 1 

72,1 (p) 



r(p) 



o 



73,1 (f») 73,2 (?) 1 



\7n,i(p) 7n,a(p) 



7n,n-l(p) 1/ 



The matrix T(p) has n rows and n columns. We equip R™ with the £ p norm and 
the set of the matrix of size n x n with the subordinated norm, both denoted || • || p 
for any 1 < p < oo. 
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Theorem 3.2. If P is j(p)- weakly dependent and P x -| a u-i) satisfies T p (C) or 

f p W (C) for alll<j <n then T p (C||r(>)||; ; n 2 /*'- 1 ) or (CWT^Wln 2 ^- 1 ) holds 
respectively. 

Remark 3.3. When the process (X t ) is stationary, we have Jijip) = Jk,e(p) f° r 
j — i — k — t. From the basic inequality ||A|| p < ||^4|| 1 / ' p ||A||^" :l ^ p and the fact that 
||r||i = ||r]|oo = 1 + ELi 7<,o(p), then ||A|| p < 1 + ELx 7i,o(p). 

Remark 3.4. In the case p = 1 we recover the Hoeffding's inequality of Djellout 
et al. [H] from the dual form of f x {C) = f{(C) = 7i(C). Recall the assumption 
(Ci)' of |12| : for any 1-Lipschitz function / it holds 

\P[f(X k+u . . . ,X n )\xW] - P[f(X k+u . . . ,X n )\y k , x^}\ < Sd(x k ,y k ). 

For any a = (a k +i , • • ■ , £f n )', as f — Y^j=k+i a jfj l& a 1-Lipschitz function whenever 
IMloo < 1 and the fj are 1-Lipschitz functions, we obtain a'Wj < S 1 with W ; = 

))i<i<fe<n- Denoting Wthenxn matrix of the 
W/c completed with we obtain ||a'W||oo < S 1 for all Halloo. By the definition of 
the matrix norm and by duality it is equivalent to ||W'||i < S. Finally, one can 
always choose T such that it coincides with the supremum of W overall (x k ,y k ) 
such that d(x k ,y k ) ^ and thus (Ci)' is equivalent to ||r||i < S. 

Remark 3.5. In the case d is the Hamming distance l x ^y then by the Kantorovitch- 
Rubinstein duality, for any 1 < p < 2 and any x^ G and 6 _E: 

inf Tr[d p (X,Y)} 
< sup sup \V(X k G = X W) - p(X fe G = y,,^^- 1 ) - x^)!. 

Here B is the Borel er-algebra and the supremum in x^\yi is taken almost ev- 
erywhere. Following the notation of Samson [33] for p = 2 and Kontorovitch and 
Ramanan [30] for p = 1, let us define for any 1 < p < 2 the 7(p)-weakly dependent 
coefficients as 

(3.6) 

%i(p) = 

sup sup \F(X k eA\xU = x®)-¥(X k G A\X i =y i ,X^ = x^)\^ p . 



The probability measure P is said to be 7(p)-weakly dependent when its coefficients 
are finite. For Markov chains, condition (|3.6p is equivalent to the uniform ergodicity. 
By definition, 7^ i < 2<p k ^i where <fi is the uniform mixing coefficient introduced by 
Ibragimov |18j . For p — 2, we obtain a weakened form of the transport inequality 
obtained by Samson [32] as 

E?-i n[a i (Y)d(X i ,Y i )} , " vi/a 

W 2 (P,Q) < inf sup 1 jV = inf (]TQM^ ^ | Ytf) . 

However, the dual form of our weak transport inequality yields the same exponential 
inequalities than those obtained in [35] . Notice however that our notion of transport 
seems too weak to yield concentration properties in term of the convex distance as 
it is done in Talagrand in [33] or in Marton in [25] . 

Proof. The proofs of the two assertions are similar as the weak dependence con- 
dition (|3.5p is symmetric in X{ and yi. Thus the proof of the second assertion is 
omitted. 
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Let us fix a 6 M + (E n ) such that Q[cS] < oo for all 1 < j < n. As preliminaries, 
we recall the following result of existence of the optimal Markov coupling due to 
from Ruschendorf, |31j and a simple and useful inequality of this result stated in 
Lemma 13.41 

Let a : E n x E n H> R + and the section of a in (xi,j/i) E E 2 as 

<Jx uVl (x2,y2) = <r{{x 1 ,X 2 ), (2/l,2/2))- 

Theorem 3.3 (Theorem 3 in |31|). We have the equivalence between (1) and (2) 
which asserts (a) and (b) 

(1) inf^g^j 7t[(t] = 7t*[(t] with n* E M, 

(2) (a) /i(x,i/) := inf ff2|1 ir[a x , y ] = ir*^[a x , y \(x,y)] is finite iti—a.s. and 
(b) inf Wl 7Ti[/i] = 7r^ [/x] < 00. 

A simple corollary of this Theorem is the following result: 

Lemma 3.4. Let P, Q E M(E n ) be decomposed as P = PiP\ Xl and Q — QiQ\y % 
for Pi, Qi E M(E) and P\ Xl , Q\ yi E M{E n ~ 1 ). Then for any a E M+(E n ) and 
any coupling tti E M{P\,Q\) we have 

(3.7) W a (P,Q)<7r 1 [Q lYl [a 1 \Y 1 }d(X 1 ,Y 1 )+W a(1) (P lXl ,Q lYl )}. 



Proof. Let us assume that for almost all x±, y\ E E we have W Q (i) {P\ Xl , Q\ Vl ) < 00. 
Then, by lower semi- continuity, it exists tt* Xi yi such that: 



J=2 



Thus the desired result follows from Theorem 13 . 31 remarking that for any x\, 2/1 EE 
we have 

by definition of Markov couplings. □ 

Let us consider now the following coupling scheme denoted ft defined recursively 
as 7r = 7r„|„_i ■ • • 7T2|i^"i|o € M(E n ) where 7Tj\j-i — n x . <v -\ x V-±) , y ti-i) ^ s determined 
such that 

n 

(3.8) ttj-u-i [£Q|r„ v o-o y w - 1) ] 1/ «7*,id(^. 



k= 

for all x^" 1 ),^- 1 ) in E l ~ x . 

We are now ready to prove the result iterating several time the same reasoning. 
Let us detail the case j = 1 when considering probabilities conditionally on 2/0 • 
Applying (13. 7p and (|2.8p we have 

(3.9) < 7r %( o) [Q| n ,„(o) [ai|F 1; j/(°)]d(X 1 , Tx) + (Pj yi>v p» , Q| n , y (o) ) 

+W 5 (i) (P\Xi ,y(°) > -PlYi ,J/(°) )] ' 

To bound the last term, we use the definition of the 7(p)-weak dependence: 
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Lemma 3.5. For any a k € M + (E) for all j < k < n and any ^(p)- weakly depen- 
dent probability measure P we have 

n 

(3.10) W a u)(P lxj!y u-iuP\ y w)< E Q\v^K\v U) ] 1/q ^d( Xj ,y 3 ) 

Proof. Assume that Q[o!?] < oo for j < k < n. Then, applying the Holder inequal- 
ity and the definition of Markov couplings, we have 

n 

W a u)(P\ Xjiy u-i),P\ y u)) = in f ^li [ E a kd(X k ,Y k ) 

* U k=j + l 

n 

< inf *\jW\v U) ] 1/9 M' P ( X k> Y ><)] 1/P 

n 

< E Q|,u)Kly w ] 1/9 inf^K(x fe ,n,)] 1/p 



< 



k=j+i 



and the result follows by definition of the 7(p)-weak dependence coefficients. □ 
Collecting the bounds (|3.9|) and (|3.10l) we obtain 

n 



fe=l 



+ W a (l) (P\ Yl ,y(0) , Q\ Yl ,y(°))]- 



Let us do the same reasoning than above for any 1 < j < n conditionally on y W on 
t a(j -i) (P| v w , Q^O) ) where denotes the section of a 3 in y w as (y i+1 , . . . , y n ) 
cij{y) and = (ctj 4 '')j>i. For any 1 < j ' < n, we obtain: 

n 
fc=j 

+ (-P|>i,y 0_1),< 5|>i^ W " 1) )]" 

For the specific Markov coupling we consider, the identity (|3.8[) holds and 

W^o-d (-P| y w-i) . Q B 0-i) ) < (E Q\y<i-^ [ a fcly (i_1) ]7fe,j) ^(-P^ | a o-i) , Q % |j,o/-D ) 

k=j 

+ Kj\ y u-i) [W a (j) (-P|y ji2/ U-i), ,2,0-1))] 
n 

< E Qly"- 1 ' Kb (j ' _1) ] 1/9 7^ W^iytf-D . Owlvu-y ) 
fc=J 

0) 

+ TTjIj/O-i) [W a (P|y jjl( W-i), Q|y 3 , y O-i))] 

where the last inequality follows from the concavity of x — > x 1 ^ and Jensen's 
inequality. Applying an inductive argument, we obtain 



W, 



*(P,Q) < ^[EE^^i^^] 17 ^^^!^- 1 ''^!^- 1 ') 
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n n 

< E E QK] 1,q i^Q[w{P X] \ Y u-v , Q VjlY u-v) p ] 1/p 
i=i k=j 

n n 

< EE«[ a *] 1/ *'TfeJ«[2CX:(Q w | rW -i)|P Si | yW -i ) ) p / 2 ] 1 / p 

3=1 fe=j 

the second inequality follows from Holder's and Jensen's inequalities and the last 
one from the assumption P x .\ y u-i) £ T p (C). Let us denote Q the row vector 

{QK] 1/q )i<k<n and Wthe column vector {Q[2CK{P Xj]Yij -v \Q Vj \Ya-») p/2 ] 1/p )i<j< n - 
With <; > denoting the scalar product, we obtain 

W a (P,Q) << Q;TW >< ||Q|| g ||r|| p ||W|| p . 

Notice that we have the identities 

.1/9 



IIQIU = (E« 

n 

W|| P = (EQ[ 2 ^W w |yo-)l^|yo-)) p/2 ] 



3=i 

3=1 

Indeed, noticing that p/2 < 1, successive applications of Jensen's inequality and 
Holder's inequality yield 

71 l/p 

||W|| p < (^Q^CmyAYU-rAP^Y^W' 2 ) 

i=i 

< n l ' p - 1 ' 2 {^Q[2CK:{Q vA YU-n\P xA YU-n)]' 1 ' 



< ^n 2 /P^2CK:(Q\P). 

Finally, we obtain 



(E?=iQ["?]) 1/9 



< j2C\\T\\lnVp-^IC(Q\P). 



The desired result follows by definition of the weak transport cost by taking the 
supremum over all a £ M + (E n ). □ 



4. Examples of 7(/j)-weakly dependent processes 

We have already noticed that when d is chosen as the Hamming distance then 
the 7(p)-weakly dependence is, for example, satisfied for ^-mixing processes with 
\\F(p)\\ P < 1 + Er=i( 2 ^) 1/p for any 1 < p < 2, see (32]. But the 7 (p)-weakly 
dependence is also satisfied for non stationary sequences, see |20) . 

For E being a real vector space, the choice of the Hamming distance is not 
natural and the resulting weakly dependent conditions are often too restrictive. In 
what follows, we focus on the more natural choice d = \\ ■ \\ the euclidian norm. 
We will extensively use the fact that probability measures satisfying weak transport 
inequalities admit finite moments of any order for d(xo, X), Vxo £ E (it also implies 
exponential moments, see the next Section). 
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4.1. Linear models. 

Example 4.1 (AR(oo) models). Consider (X t ) the stationary solution solution of 
the autoregressive equation 

X t = OiXt-i + 6 
i>i 

where the real numbers a; are such that 1 — X)j>i a i z% does not have unit root 
outside the unit circle. Then the weak dependence condition (|3.5p is satisfied with 
7kjj(p) = l a fc— il f° r an y 1 — P — ^ and any < j < k < n. 

Example 4.2 (MA(oo) models). Let consider X t — Y^iLi a i^t-i with real numbers 
cii such that 

oo 

iaii < oo. 

i=i 

Then the model is well defined and if it is invertible, i.e. aiZ 1 has no root 

outside the unit circle, then it admits an AR(oo) representation and the weak 
dependence condition (|3.5p holds with Jk,j{p) < | J2j<k X)*i+-+i-=A ITi=i a it\ for 
an y 1 P _• 2 and any < j < A; < n. 

Example 4.3 (ARMA models). Let us consider the ARMA model 

X (x) = x, X t+1 {x) = AX t (x) + £ t+1 

in E = H d where A g Md,d (the space of d x d matrices) and (Z t ) is a sequence 
of i.i.d. random vectors in M' i called the innovations. This model is a particular 
case of the general model above with ipt(%) = Ax + Z t . The 7(p)-weak dependence 
condition is equivalent to 

Psp(A) := max{|A|; A is an eigenvalue in C of A} < 1, 

which is the necessary and sufficient condition for the ergodicity of this linear 
ARMA model (X t ). 

4.2. Non-linear models. 

Example 4.4 (Stochastic Recurrent Equation (SRE)). Consider the SRE (also 
called Iterated Random Functions) 

(4.1) X (x):=xeE, X t+x {x)=il H+ i{X t {x)) ) t > 0, 

where (tp t ) is a sequence of i.i.d. random maps. Let us denote also P the probability 
of the whole process (A t ) t >o- For any 1 < p < 2, if the distribution of ipi( x ) belongs 
to T p (C) or Tp\c) for any x € E and that there exists some S > satisfying 

oo 

(4.2) J2 P i d ( X ^),Mx')) p } 1/p < Sd(x,x') Vx,x'eE. 
t=i 

then P € f p (C(l + Sfn 2 /?- 1 ) or T p (i) (C(l + Sfn 2 ^' 1 ). 

Example 4.5 (General affine processes). Consider now the specific SRE 

X {x) = x,X t+1 (x) = f(X t (x)) + M(X t (x))tt+i, 

where E = lR d , £, t € K d ', / : H d h> ]R d , M : R d i-> M d ,d' (the space of d x d' 
matrices) and the noise (£t) is a sequence of i.i.d. random vectors of R d such that 
its distribution P^ is centered. Fix p = 2 and assume that: 

(1) P ( € f 2 (C) or T 2 (l) (C) on R rf ' w.r.t. the Euclidean metric; 

(2) there exists K > such that ||M(x)|| 2 <K,\/xE K d ; 
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(3) the Lyapunov exponent in L 2 satisfies 

, lT 2, y ( P[\X t (x) - X t (y)\ 2 ] ^/t 
\ max (L ) := hm (sup ^— ^ j < 1. 

Using a version of Lemma 2.1 in |12) we obtain that conditions (1) and (2) implies 
that P Xz \ Xl _ 1 £ f 2 (CK 2 ) or f^\CK 2 ). Moreover condition is satisfied with 
S = (1 - \ max (L 2 ))- 1/2 and thus P e T 2 (C7X 2 (1 + (1 - A^L 2 ))- 1 / 2 ) 2 ) or 
(C7if 2 (l + (1 — A mQ2; (L 2 )) _1/ ' 2 ) 2 ). We answer positively to a question raised in 
Remark 3.6 in |12j . However notice that the condition \ m ax{L 2 ) can be difficult to 
check on specific models. One possible sufficient condition is the Lipschitz mixing 
condition of Duflo [14] asserting the existence of K > and < r < 1 such that 

P[\X t (x) - X t (y)\ 2 < Ki*\x ~ y\ 2 ,Vx, y e E. 
Another possibility is given in the next example. 

Example 4.6 (Iterated Random Lipschitz Maps). Consider the general SRE (|4.1[) 
and assume that the random maps ip t are Lipschitz-continuous. Denote the Lips- 
chitz coefficient of any function / by 

A(/); = sup ffl£ME». 

X jty d{x,y) 

Let P^ be the distribution of the sequence of iid random maps [tp t ) The top Lya- 
punov exponent A* is defined as lim^oo t^ 1 log(ipooip-io ■ ■ ■ cnp-t+i). Its existence 
in R, U — oo is due to the subadditive ergodic theorem. The condition A* < is 
sufficient for the existence of the stationary of the SRE. It implies that X m ax(L 2 ) 
when ip!(x) belongs to f p (C) or f^{C) for any x G E. 

Example 4.7 (Chains with Infinite Memory). Let us consider now the case of 
Chains with Infinite Memory introduced by Doukhan and Wintenberger |13| : 

x t = Fpr t _i,x t _ 2 ,...;&), Viez. 

This model does not exhibit any Markov property. Assume there exists a sequence 
of non negative numbers (a^) such that 

Psld(F( Xl ,x 2 , 0,F( yi ,y 2 , 0) p ] 1/p < £ Mfo, Vi ). 

i>l 

If J2i>i a i < 1 an< i F( x i> x 2, •••;£) i s m ^(C) or (C) the stationary measure 
exists and (|3.5I) holds with Jk,j(p) < o,k-j for any < j < k < n. 

5. New exponential inequalities 

5.1. General exponential inequalities. Let X = (X±, . . . , X n ) be distributed as 
P and consider the function / : E n h-> 1R such that there exist auxiliary functions 
Lj : E n M- H + . 1 < j < n satisfying 

n 

(5-1) f(y)-f(x)<J2L j (y)d(x j ,y j ) Vx,yeE n . 

3=1 

Let us consider the function g : E n i-> IR such that there exist auxiliary functions 
hf : E" \-> R + , 1 < j < n such that 

n 

(5.2) g{y)-g{x)<Y J Lf{x)d{x j ,y j ) Vx, y eE n . 

3=1 

The dual form of the weak transport inequalities implies the following new expo- 
nential inequality: 
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Theorem 5.1. If P satisfies T p (C) and f satisfies (|5.ip then for all X > we have 



(5.3) 



P 



«p(A(/-™-^((2-i)i>r+(H-i 



< l. 



3=1 



If P satisfies Tp (C) and g satisfies (15. 2p then for all X > we have 



(5.4) p[exp(x(g-P\g}) 



' 3 = 1 



(i)p-l 



- 1 



< 1. 



Remark 5.1. Consider the case of a 7(2)-weakly dependent sequence supported by 
[0,1]™. Then P satisfies f(||r(2)|||) and T 2 W (||r(2)|||) by Theorem O and the 
above results apply with C = ||r(2)||. 

Proof. The proofs of (|5.3[) and (15 -4[) are similar. We only detail the first one. 
Integrating (|5.ip in (x, y) by n with marginals P and Q we get 



.7=1 



and by definition of W we obtain 

Q[/]-P[/]<Q[X>j] tWO- 
Using that P G f 2 (C) we obtain 

(5.5) Q[(/ - P[f})} < q[%2 L j] L/9 V2CJC(P\Q). 



3=1 



1 1/1 



3=1 



From the variational identity 



ab= inf Xayq + bP/iX^p) 

AX) 

we get for all A > 0: 

n 

Q[{f-P[f])]<XC/qQ[Y,Lf\ +JC(P\Q7 /2 2^ 2 C 1 ~P/ 2 /(XP- 1 p). 
We can rewrite it as 



3=1 



2/ 

{p/2)Q[{plC) x -^- x {f - P[f}- XC/q^L*)] P <K(P\Q). 

3 = 1 

From the Young inequality 

(p/2)x 2/p >yx- (l-p/2)y 2/{2 - p) 
applied with y — {CX 2 /p) 2 / p ~ 1 we obtain 

( P /2)((p/C) 1 - p/2 X p - 2 ) 2 ^x 2/p >x-(l- P /2)CX 2 /p 
For x = Q[X(f - P[f] - XC/q E" =1 L))] we obtain 

n 

Q [X(f - P[f] - XC/q L])] - K.{P\Q) < (1/p - 1/2)CX 2 . 

3=1 

Then the desired result follows from the variational formula of the entropy. 



□ 
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5.2. Extensions of classical concentration inequalities to dependent cases. 

A first corollary of Theorem l5.1l is an extension of the classical following inequality: 
let / be a separately convex Lipschitz function on [0, 1]™ then 

(5-6) P(|/-P[/]|>t)<2exp(-^) 

where L satisfies \f(x) — f(y)\ < L\\x — y\\ for any x, y € [0, l] ra equipped with the 
Euclidian norm. This result was extended to contracting Markov chains in Marton 
[2"5] and to -y(2)-weakly dependent processes in Samson [32] for convex Lipschitz 
functions /. The extension to the more general 7(2)-weakly dependent context is 
a straightforward Corollary of Theorem 15.11 From Remark 15.11 we know that P 
satisfies f(||r(2)|||) and T 2 (i) (||r(2)|||). Remark that with no loss of generality we 
can assume that / is smooth enough (see Samson [32] for a detailed proof of this 
well known fact). Then for any x, y S [0, 1]™, by convexity we have 

n n 

f(x) - f(y) < 5>J/0=)(*J - Vj) < £ \d 3 f(x)\\ Xj - y \. 

3 = 1 3=1 

Thus / satisfies condition (15.11) with Lj = djf. From the Lipschitz assumption on 
/ we assert that Y?j=i ^ji x ) = ||V/|| 2 < L where V/ denotes the usual gradient 
of /. An application of Theorem 15. II yields that 

P[exp(A(/ - P[f})] < exp(||r(2)||^ 2 A 2 /2). 

From similar arguments — / satisfies (|5.2j) with (x) < L and the same 

estimate holds on the Laplace transform of — / + P[f]. Applying the classical 
Chernoff arguments yields 

Corollary 5.2. For any r y(2)-weakly dependent sequences on [0, 1]" , for any convex 
L-Lipschitz function f it holds 

n\f-P[f]\>t)<2e XP (- w ^ m ). 

This type of inequalities have a lot of applications, see [33] . 

From a statistical perspective, it is also interesting to investigate the properties 
of the empirical process. As a corollary of Theorem 15. II we also obtain a Poissonian 
inequality for the empirical process f(x) = sup g X)"=i ff 2 (^j) f° r s qu a re of real 
valued Lipschitz functions. Similar results are obtained in Section 3 of Boucheron 
et at 0. 

Corollary 5.3. Assume that there exists (£i)i<i< n such that for any g G Q we have 

n 

\g(x)-g(y)\<Y / ^d(x l ,y t ) 7 Vx,y e E n 

i=l 

with X^ILi &i — L 2 < oo. If P satisfies T2(C) and T 2 (C) then for every t > we 
have 

p(f>p[f ]+ t) < ex P (- 8C£2( ; 2 [/]+0 ), 

P(f<P[f]-t) < ex P (-^ [7I ). 
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Proof. From the convex inequality x 2 —y 2 < 2x(x—y) we easily check that / satisfies 
USD with £™ =1 L 2 < AL 2 f and -/ satisfies (J5HJ) with £™ =1 ij l)2 < 4L 2 /. As P 
satisfies T2(C) an application of (|5.3[) yields that for all A > we have 

P[exp(/(A-4CX 2 A 2 )-AP[/])] < 1. 
An application of the Chernoff argument yields that for every < A < [ACL ) _1 

P(f > P[f] +t)< exp(-iA(l - 4CL 2 X) + 4CL 2 X 2 P[f}). 
Optimizing in A we obtain 

A 



8CL 2 (t + P[f]) 

and the first inequality of the Corollary follows. For the second inequality, we apply 
inequality (|5 .4[) to obtain, for any A > 0, that 

P[exp(A(/ - P[f]))} < exp(4CL 2 A 2 P[/]). 

The desired inequality follows by the Chernoff argument. □ 

5.3. The specific case of the Hamming distance. We fix d(x, y) = l X jty as m 

Samson [32]. Thus the result of this section holds for any 7(2)-weakly dependent 
sequence (with no restriction on the margins). Using exactly the same arguments 
than above, an extension of the classical exponential inequality (|5.6p also holds in 
this case 

P(\f-P[f}\ >*) < 2expf 1 - ) 

W WJI - J- v 2 ||r(2)|| 2 L 2 / 

where T(2) is the matrix corresponding to the coefficients 7(2) 

The case of the Hamming distance is specific because for any non negative func- 
tion / we can replace the convexity argument x 2 — y 2 < 2x(x — y) by the sim- 
ple inequality f{x) — f(y) < f(x)l x ^ y . Let us consider the empirical process 
f(x) = I sup g Y^i=i d(Xi)\ for some set of non negative real functions Q bound- 
ing by M. Then / satisfies (J5T]) with YT J= i L ] < M 7 an d -/ satisfies d5T2j) with 



E"=i4 < Mf. Applying Theorem IO we recover the results of Theorem 2 of 

Theorem 5.4. If < g < M for all g £ Q then for every t > 

t 2 



P(f>P[f]+t) < exp 
P(f < P\f] ~t) < exp ( 



2M||f(2)|| 2 (P[/]+t)V 



2M||r(2)|| 2 P[/] 

The constant 1 in T2 (1) or T 2 (1) is optimal in Theorem as discussed in Boucheron 
et al. [S] for the iid case. We refer the reader to this article for nice statistical ap- 
plications of this result in the iid case. 

Due to the simple inequality f(x) — f(y) < f(x)l XJ iy, it is also possible to extend 
classical Bernstein's inequality in the 7(2)-weakly dependent case: 

Theorem 5.5 ([35| (page 460, line7)). Let g be a measurable function R — > 
[-M,M] and let 
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Then for all < A < l/(M||f (2)|||) we have 

n 

P[exp(A(/ - P[f]))) < cxp (8||f(2)||^P[( 3 (^) - P[.9PQ]) 2 ]A 2 ) . 

i=l 

This inequality has been applied to obtain exact oracle inequality with fast rates 
in the 7(2)-weakly dependent context in [2J. 

6. Applications to oracle inequalities with fast convergence rates 

In this section we use the weak transport inequality to obtain new nonexact oracle 
inequalities in the 7(2)-weakly dependent setting and oracle inequalities in the 7(2)- 
weakly dependent setting. Instead of using extensions of classical inequalities given 
in the last Section we prefer to use a more direct approach using the PAC-bayesian 
paradigm. It allows us to consider the mathematical statistic problem of asserting 
oracle inequalities as a problem of conditional mass transport. 

6.1. The statistical setting. We focus on the oracle inequalities of the the or- 
dinary least square estimator. Let us consider the case of linear regression where 
E = lR d+1 , X = (Y, Z) = (Y, Z^\. . . , Z&) equipped with the euclidian norm || • ||. 
The empirical risk is denoted 

n 

n ' 

i=l 

where (Xj)i<,<n = {Y%, Zi)i<i< n are the observations. In our context, these ob- 
servations are not necessarily independent and we denote by P their distribution. 
The risk of prediction is denoted 

r{9) = P[r(0)] we e R d . 

The aim is to estimate the value 9 € M d such that R{9) < R(9), \f9 € R d . We 
consider the ordinary least square estimator 9 of 9 such that r(9) < r(9) for all 
9 € H d . Let us denote the excess of risk R{9) = R(9) — R{9) > 0, r its empirical 
counterpart, Z — (i?i)i<i< n the n x d matrix of the design, \\Z\\\ = rT 1 J^ILi ll^ll 2 
and G = P[Z T Z] its corresponding Gram's matrix. Assume that G is a definite 
positive matrix and denote p = max(l, p sp (G^ 1 )). All the results of this sections 
are given for probability measures P satisfying T 2 (C) and T 2 W (C) for some C > 
on E n . In view of Theorem 13.21 and for applications perspective in time series we 
are interested on 7(2) or 7(2) weakly dependent observations. The case of possibly 
non linear autoregression is of special interest. There the vector Zi is a function of 
the past values <fi(Yi, . . . , Here (p is known, one can think of the projection 

on the last coordinates (case of linear autoregression) , functions on Fourier basis or 
wavelets, etc. The regularity of the function tp impact the concentrations properties. 
The constant C in the weak transport inequality has to be estimated in each specific 
statistical case. For example, in the linear autoregressive case of order t > 1 fixed, 
we have 7(2)^0 < 7(2) nt/ei and in the non-linear autoregressive case, 7(2)^0 < 
IM|oo7(2)|-fc/f-|.o- Finally notice that 7(2) coefficients are nicely estimated for any 
bounded measurable functions tp whereas it is not the case of 7(2) coefficients that 
require more regularity on ip. 

6.2. Nonexact oracle inequality for 7(2)-weakly-dependent sequences. Our 

first result is a bound on the excess of risk. 

Theorem 6.1. For any measure Q and any (3 > we have 

(6.1) Q[R(9)} < Q[\\Z\\ 2 n }/P + 4\/ ' P CQ[K]n-^(K(P\Q) + /?Q[i?(<?)]/2) 
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wh 



ere 



K:=^ + (l + \\W + d -^)m + (WW + + (1 + \\W)r{6). 

Proof. Considering the change (2,9) -> (ZG~ 1/2 ,G 1/2 9) 7 we assume that the 
Gram matrix G is the identity matrix. This change of variable is p-Lipschitz func- 
tion. Thus ZG- 1 / 2 satisfies f 2 (pC) and f^(pC) using similar arguments than in 
Lemma 2.1 in [12]. Thus in the sequel G = I d , Z G f 2 (pC) and f^(pC). With 
this notation, F[||Z|| 2 ] = d where and \\§ - 9\\ 2 = R(9) - R(9). We adopt the so 
called PAC-bayesian approach considering that 9 = pg[9] where pg — Afd(9, (3~ l Id) 
for any j3 > 0. This probability measure is measurable with respect to the obser- 
vations (Xi). Thus, the properties of the measure Ppg are not simple to handle 
directly. The PAC-bayesian approach consist in introducing artificially the measure 
pj called a priori because it does not depend on the observations (Xi). Let us fix 
some measure Q and denote Qg the probability measure such that pgQg = Qp§- 



Let us first study similar properties than in (15.11) of the function f = r. With 
some abuse the euclidian norm on any vector space will also be denoted 1 1 • 1 1 . Using 
the inequality x 2 ~ y 2 < 2x(x — y) < 2|x||a; — y\ for any i,i/eE we obtain 

1 " 

f(x) - f(x') < - - Zl 9) 2 - (y[ - z>9) 2 + (y[ - z[9) 2 - (y t - Zi 9) 2 ) 



n 

1=1 



< 



2 



£(|l/ 4 - 0)1111^ - + \y\ - ^|(||(1, - 4||). 



n 

i=i 



Then by definition of W 2 and using Cauchy-Schwartz inequality we obtain condi- 
tionally on 9 that 



P[f] - Qe{f] < 2|| (1, 9)\\ ^n^R(9)W 2 (Q 8l P) + 2|| (1, 9)\\ n^Qg[r(9)]W 2 (P Qg) 

As P satisfies f 2 (pC) and f^(pC) and using the Cauchy-Schwartz inequality we 
obtain 



Qe[P[f] - /] < ^ pCn-^JC(P\Q e )((l + \\9W)R(9) + (1 + ||0|| 2 )Q e [r(0)]). 
The positivity of the integrand with respect to pj yields 

PeQe[P[f] - f] <A n [^ pCn-^K,(P\Q e )((l + \\9\\ 2 )R(9) + (1 + \\9\\ 2 )Qg[r(9)]) 

<4^pCn-^pj[IC(P\Qe)}(Pe[(l + \\0\\ 2 )R(9)] + (1 + ||0|| 2 )Q[r(0)]). 

Notice that by definition PgQg = Q/5g such that we have pg[K(P\Qg) = JC(P\Q) + 
Q[JC(pg\pg)]. Moreover K(p s \pg)\ < /3/2(R(§) - R(9)) so that we obtain 

Qp § [R(9)-R(9)-r(9)+r(9)} < 

4^fpCn-i(IC(P\Q') + P/2Q[R(9) - R(9)])Pe\^ + \\0)\\ 2 )R(0)} + (1 + \\B\\ 2 )Q[r(§)])- 

Now, by Jensen's inequality Qps[R(6)] > Q[R(9)] and computations gives that 
Qp § [r(9)} < r(9) + Q[\\Z\\ 2 n }/ (3 < r(9) + Q[\\Z\\ 2 n }/ (3. Collecting those bounds, we 
obtain 

Q[R(9)-R(9)-\\Z\\ 2 n ]/f5] < 

P Cn-\K(P\Q) + (3/2Q[R(9) - R(9)})pgl(± + \\0)\\ 2 )R(9)} + (1 + \\9\\ 2 )Q[r(9)})- 
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To end the proof, let us compute pg[(l + \\8\\ 2 )R(9)} using the following identity 

+ \\e\\ 2 )R(9)} = n [R{e)\ + n [\\ef]R{e) + n [\\efR{e) - r(5)}. 

Let us decompose the last term: 

nl\\8\\ 2 R(.o) - R(o)} = PeMWW] + 2n- 1 p[yz}^[\\e\\ 2 (e - e)\ 

where y = (Y\, . . . , Y n ). Simple computations on gaussian random variables give 

PS[R(9)] = RQS) + d/13 

Pel\\0\\ 2 } = \\W + d/f3 

p w [\\9\\ 2 (9-9)]=29/f3 

PelWOfWO W] = (WW + d/m l)//3 + \\9\\ 2 IP + ^/ fi. 

The desired result follows collecting all these bounds and noticing that AP[yZ]9 < 
2nR(9). a 

In the proof above, we obtain the more general result: for any probability mea- 
sures /i and v such that there exists Qg satisfying Qfi = vQq we have: 
(6-2) 

QH\R\ < + ^ P Cn-^1C{Pv\Qy)(v[{l + ||0|| 2 )i?(0)] + (1 + \W)Q[r{6)\. 

This bound is obtained by integrating with respect to v the conditional mass trans- 
port of Qgo-r(6)^ 1 to Por(6)^ 1 . The weak transport inequalities satisfied by P 
and the convex properties of the function (x±, . . . , x n ) — ► r(6>) are used to obtain a 
bound conditionally on 9. 

Let us discuss the choices fj, = ps and v = pj made above. Notice that /x 
and v have the same support from the assumption Q\i — vQq. As soon as /i 
is centered in 9, Jensen's inequality yields Qfi[R(9)] > Q[R(9)]. Next, if /i is 
sufficiently concentrated around 9 then Qp\r(9) — r(9)} is small as r{9) — r{9) < 0. 
Choosing /i as the Dirac mass in 9 is excluded as the existence of some measure 
Qg satisfying vQg = Q[i. The fact that the support of /z cannot depend on the 
observations (Xi) constrain us to choose measures supported on the whole space 
TR d in absence of a priori information on 9. The term Qfi[r(9) — r{9)\ can be seen as 
an alternative to the classical VC-dimension, see Mc Allester [29]. The measure \i 
should be chosen in order to bound this term (and the entropy /C(V|/i)). It leads to 
Gibbs estimators that are nice alternatives to classical estimators, see Chapter 4 of 
the textbook of Catoni [TD] in the iid case, Alquier and Wintenberger [21 [2] in weakly 
dependent settings. Here we choose the gaussian measures fi — pg and v = pg as 
in Audibert and Catoni [4] for simplicity because we have an explicit computation 
JC{v\p) = P/2\\6- 9\\ 2 . This choice leads to estimate the term Qp.[r(9) ~ r{9)] by 
This term can easily be estimated with d/f3 and a concentration term 
implying the entropy JC(P\Q) in order to obtain a nonexact oracle inequality: 

Corollary 6.2. For any < e < 1 and any (d + 2)/n < -q < 1 we have with 
probability 1 — e: 

B 2 d+16pClog(e~ 1 ) , B 3 



R{9) < (l + B 1 j 1 )R{9) 
where 



nr) (nrj)' 2 



B 1 = 2(3 + 2\\9\\ 2 + V /n), 

B 2 = 2(5+\\9\\ 2 ), 

B 3 = 2{d{d-l) + d/n). 
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Remark 6.1. This result extends nonexact oracle inequalities as developed by Lecue 
and Mendelson [3T] to a dependent context but for the OLS only (and not also 
regularized estimators). 

Proof. As for any a, b > we have 2^/ab < aX + b/X for any A > then from (|6.ip 
we obtain 

Q[R(9) - \\Z\\ 2 /(3 - KX/n - 0R(6)/(2X)} - A P CK ^) < . 

A 

Notice that by definition of K we have 

Q[K] = 4| + (i + p| 2 + + (\\W + + (i + \\e\\ 2 )Q[r(o)} 

by similar arguments than in the proof of Theorem 16.11 we have 



Q[r(9)}-R{9) < 2\j2pCR{9)n~ 1 K,{P\Q). 
Similarly, as P[||Z|| 2 ] = d we obtain 



Q[\\Z\\ 2 ] -d< 2 v / 2pCdn- 1 IC(P\Q). 
Collecting those bounds and using Cauchy- Schwartz inequality we obtain 
Q[\\Z\\ 2 /p + X/nr(e)] <d/p + X/nR(9) 

+ 4 v /pCn- 1 (d//3 2 + (X/n) 2 R(9))JC(P\Q). 



Using again that 2Vab < aX + b/X, choosing f3 = X = nij and by definition of B\, 
P>2 and i?3 we have 

Q[R(9) - B lV R(9) - B 2 d/(n V ) - B 3 /(n V ) 2 ] < l^Sglg) , 

ni] 

Choose Q as the probability P restricted to the complementary of the event corre- 
sponding to the desired oracle inequality and denoted A. Then 

lepciog^- 1 ) ^ ^- f ^ D „ D ^ DJ „_ N D „ , a . 



nq 



< Q[R{6) - B lV R(9) - B 2 d/{ni 1 ) - B 3 /(nr)y 



Combining these two inequality we assert that for this specific Q we have — log(e) < 
K,(P\Q). The relative entropy can be computed explicitly K,(P\Q) = — log(l— P(A)) 
and thus the desired result follows. □ 

6.3. Exact oracle inequality for 7(2)-weakly-dependent sequences. Let us 

now give an equivalent of (|6.2p when we equipped E with the Hamming distance 
d{x,y) = l x ^y Instead of using the convexity of x i— > x 2 as above, we use that 
f(x) - f(y) < \f(x)\l X jt v + for an y /■ Following the lines of the proof 

above with / = r we obtain easily 



(6.3) Qp[R] < Qp[r] + 2^ 2pCJC{Pv\Qp){Pv[r 2 } + Qfi[r 2 }). 

For the specific choice p, = Pg and v = pj we use computations given in Lemma 1.2 
in the supplementary material of [3] stating that for any 9 6 H d 

np np z 

where \\Z\\ 4 n = nT 1 YJLi W 2 ^- The quantities Q[\\Z\\ 2 n r{9)\ and Q[||2||£] can be 
difficult to estimate for desired choices of Q. Let us work under the following 
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assumption on the set of parameters Q C R d containing the support of P and the 
unit disc: there exists some finite constant B > such that 

(6.4) B - sup =5 — 

eee 2^i=i P[^m 

Similar assumption has been used in the iid case by Audibert and Catoni in [3]. 
Under (|6.4p and the fact that we assume P[||Z|| 2 ] = d with no loss of generality (see 
discussion in the proof above) we have ||^|| 2 < Bd and H-Z^ 2 < (Bd) 2 . Moreover, 
using computations given in the supplementary material of 0] we obtain easily that 



r 



(0) < n~ L (2B z + 8Br(0))R(9). 



It leads to the following equivalent of Theorem 16.11 
Theorem 6.3. If condition t\6A$ holds. we have 



nd i 

Q\R0)] < — +2^2pCn- l {K{P\Q) + f3Q{R0)]/2)x 

\Jq[(10B 2 + A0Br{9))R(9)] + ABd(R(9) + Q[r(9)})/(3 + 8(Bd/(3) 2 . 

In the above estimate the terms involving r(0) are nuisance terms because there 
is no control on 9. If this term is bounded then the main term multiplying the 
entropy is proportional to the excess risk Q[R(9)]. It is the major advantage con- 
sidering the Hamming distance compared with the Euclidian distance where instead 
Q[R(9)] appeared. In the classical approach as developed by Massart in [5S], the 
excess risk also appears via the variance term in Bernstein's inequality under the 
margin assumption of Tsybakov |35| that estimates this variance term by R(9). 
As Q[R(9)\ is the quantity of interest, we can obtain exact oracle inequality the 
following corollary 

Corollary 6.4. For any < e < 1 and any M > we have with probability 1 — e: 

me) < me) + i6o B ' + 4BM x 



d{R{9) + M) : 8(Bd) 2 
10B + 40Af 

Remark 6.2. As already noticed by Audibert and Catoni in [5] in the iid case, 
the exact oracle inequality holds for 7(2)-waekly dependent sequences without any 
assumptions on the margins P except (|6.4[) (because any probability measures have 
supports diameter bounded by 1 for the Hamming distance d) . We refer the reader 
to [3] for a nice way to bound the term log P(r (9) > M) in the iid case under finite 
moments assumption on P of order 4 only. 

Proof. Let us denote A = {r(9) < M} and Pa the restriction of P on A defined 
as Pa(B) = P(B n ^4) for any measurable set B on E n . We do not know wether 
Pa satisfies any weak transport inequality. However, a similar reasoning than for 
obtaining (|6.3p yields 



Q[R{9)\ < Bd/(3 + P§ U{ABdR{9)/l3 + (ABd/p) 2 )n-^W 2 {Q e ,PA) 



+ yj {(10B 2 + 40BM)Q[R(9)} + ABdM/(3 + (4Bd//3) 2 )n- 1 I^ 2 (P j4 , Q e ) 
Now let us the triangular inequality of the weak transport cost 
W 2 (P A ,Qe) < W 2 (PA,P) + W 2 {P,Qe), 
W 2 (Q 9 ,Pa) < W 2 (Q e ,P) + W 2 (P,P A ). 
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Because P satisfies X2(pC) and T^ipC), both RHS terms are estimated with 



^2 P CH{P A \P) + y/2pCJC(P\Q e ) < ^ P C{K{P\Q e ) - logP(A)) 
Collecting all these bounds and using the Cauchy- Schwartz inequality we obtain 



Q[R(B)} < Bd/P + i\J 2pCn- 1 (IC(P\Q) + PQ[R(0)]/2 - log P(A))x 

\J ((KL9 2 + 40BM)Q[R(9)} + ABd{R{9) + M)/(3 + 8{Bd/f3) 2 ). 



Using several times the inequality 2\fab < aX + b/X with A = (3 = n(40-B 2 + 
WOBM) -1 yields 

m/i < „*±™ ( M+ ifC(K(m _ log P(A)) + + «!) 

n V \\)B + 40m n / 

We conclude as in the proof of Corollary 16.21 choosing Q as P restricted to the 
complementary of the event corresponding to the desired oracle inequality. □ 
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